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Abstract 

This paper addresses the problem of automatically lo¬ 
calizing dominant objects as spatio-temporal tubes in a 
noisy collection of videos with minimal or even no super¬ 
vision. We formulate the problem as a combination of two 
complementary processes: discovery and tracking. The 
first one establishes correspondences between prominent 
regions across videos, and the second one associates suc¬ 
cessive similar object regions within the same video. Inter¬ 
estingly, our algorithm also discovers the implicit topology 
of frames associated with instances of the same object class 
across different videos, a role normally left to supervisory 
information in the form of class labels in conventional im¬ 
age and video understanding methods. Indeed, as demon¬ 
strated by our experiments, our method can handle video 
collections featuring multiple object classes, and substan¬ 
tially outperforms the state of the art in colocalization, even 
though it tackles a broader problem with much less super¬ 
vision. 

1. Introduction 

Visual learning and interpretation is traditionally formu¬ 
lated as a supervised classification problem, with manually 
selected bounding boxes acting as (strong) supervisory sig¬ 
nal [7, 9]. To reduce human effort and subjective biases 
in manual annotation, recent work has addressed the dis¬ 
covery and localization of objects from weakly-annotated 
or even unlabelled datasets [4, 5, 8, 26, 28]. However, this 
task is difficult and most approaches today still lay signif¬ 
icantly behind strongly-supervised methods. With the ever 
growing popularity of video sharing sites such as YouTube, 
recent research has started to handle the similar task in 
videos [15, 23, 25, 33], and has shown that exploiting the 
space-time structure of the world, which is absent in static 
images, e.g., motion information, may be crucial for achiev¬ 
ing object discovery or localization with less supervision. 
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This paper addresses the problem of spatio-temporal ob¬ 
ject localization in videos with minimal supervision or even 
no supervision. Given a noisy collection of videos with 
multiple object classes, dominant objects are identified as 
spatio-temporal tubes for each video (Fig. 1). We formu¬ 
late the problem as a combination of two complementary 
processes: object discovery and tracking. In our daily ex¬ 
perience, salient motion often primes us to recall similar 
visual patterns as an object from our memory, and such 
recalled patterns help us to localize the object over time. 
Likewise, object discovery, whose aim is to establish cor¬ 
respondences between regions depicting similar objects in 
frames of different videos, is closely connected to object 
tracking, whose aim is to associate target objects in con¬ 
secutive video frames. Building upon recent advances in 
efficient matching [4] and tracking [22], we combine region 
matching across different videos and region tracking within 
each video into a joint optimization framework. We demon¬ 
strate that the proposed method substantially outperforms 
the state of the art in colocalization [15] on the YouTube- 
Object dataset, even though it tackles a broader problem 
with much less supervision. 

1.1. Related work 

Our approach combines object discovery and tracking. 
The discovery part establishes correspondences between 
frames across videos to detect object candidates. Similar 
approaches have been proposed for salient region detec¬ 
tion [16], image cosegmentation [31, 32], and image colo¬ 
calization [4]. Conventional object tracking methods [35] 
usually require annotations for at least one frame [12, 14, 
34], or object detectors trained for target classes in a super¬ 
vised manner [1,2, 22]. Our method does not require such 
supervision and instead alternates discovery and tracking of 
object candidates. 

The problem we address is closely related to video object 
colocalization [15, 23], whose goal is to localize the com¬ 
mon object in a video collection. Prest et al. [23] generate 
spatio-temporal tubes of object candidates, and select one 
of these per video through energy minimization. Since the 
candidate tubes rely only on clusters of point tracks [3], this 
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Figure 1. Given a noisy collection of videos, dominant objects are automatically localized as spatio-temporal tubes. The discovery process 
establishes correspondences between prominent regions across videos (left), and the tracking process associates similar object regions 
within the same video (right). (Best viewed in color.) 


approach is not robust against noisy tracks and incomplete 
clusters. Joulin et al. [15] extend the image colocalization 
framework [28] for videos using an efficient optimization 
approach. This method does not explicitly consider corre¬ 
spondences between frames from different videos, which 
are shown to be essential for robust localization of common 
objects in our experiments of Section 5.3. 

Our setting is also related to object segmentation or 
cosegmentation in videos. For video object segmentation, 
clusters of long-term point tracks have been used [3, 19, 20], 
while assuming that points from the same object have sim¬ 
ilar tracks. In [17, 21], appearances of potential object and 
background are modeled and combined with motion infor¬ 
mation for the task. These methods produce results for indi¬ 
vidual videos and do not investigate relationships between 
videos and the objects they contain. Video object coseg¬ 
mentation aims to segment a detailed mask of common ob¬ 
ject out of videos. This problem has been addressed with 
weak supervision such as object class per video [29] and 
additional labels for a few frames that indicate whether the 
frames contain target object or not [33]. 

1.2. Proposed approach 

We consider a set of videos v, each consisting of T 
frames (images) Vt (t = 1,..., T), and denote by R{vt) 
a set of candidate regions identified in Vt by some separate 
bottom-up proposal process [18]. Every region proposal is 
represented by a box in this paper. We also associate with 
Vt a matching neighborhood N{vt) formed by the k closest 
frames Wu among all videos w ^ v, according to a robust 
criterion based on probabilistic Hough matching (see [4] 
and Section 2.1). The network structure defined by N links 
frames across different videos (Fig. 1, left). We also link 
regions in successive frames of the same video, so that Vt 
in R{vt) and in R{vtJ^i) are tracking neighbors when 
there exists some point track originating in Vt and terminat¬ 
ing in rt+i (Fig. 1, right). A spatio-temporal tube is any se¬ 


quence r = [ri,..., Vt] of temporal neighbors in the same 
video. Our goal is to find, for every video v in the input 
collection, the top tube r according to the criterion 

T T-1 

!!.(>') = E ip[rt,Vt,N{vt)\+\'^%l>{rt,rt+i), (1) 

t=l t=l 

where Vt^Nfvt)] is a measure of confidence for vt be¬ 
ing an object (foreground) region, given Vt and its matching 
neighbors, and f>{rt^ ^t+i) is a measure of temporal consis¬ 
tency between Vt and r^+i. 

As will be shown in the sequel, given the matching net¬ 
work structure N, finding the top tube (or for that matter 
the top p tubes) for each video can be done efficiently using 
dynamic programming. Note that both the matching and 
tracking network structures are a priori fixed. However, the 
matching network is huge, every frame in a video being a 
priori linked to all other frames in all other videos, and, as 
will be shown in Section 2.1, computing the matching score 
between two frames is itself nontrivial. We therefore choose 
instead to use an iterative process, alternating between steps 
where N is fixed and the top k tubes are computed for each 
video, with steps where the top k tubes are fixed, and used 
to update the matching network. After a few iterations, we 
stop, and finally pick the top scoring tube for each video. 
We dub this iterative process a discovery and tracking pro¬ 
cedure since finding the tubes maximizing foreground con¬ 
fidence across videos is akin to unsupervised object discov¬ 
ery [4, 10, 11, 24, 27], whereas finding the tubes maximiz¬ 
ing temporal consistency within a video is similar to object 
tracking [1, 2, 12, 22, 34, 35]. 

Interestingly, because we update the matching neighbor¬ 
hood structure at every iteration, our discovery and tracking 
procedure does much more than finding the spatio-temporal 
tubes associated with dominant objects: It also discovers the 
implicit neighborhood structure of frames associated with 
instances of the same class, which is a role normally left to 


2 



























































































































supervisory information in the form of class labels in con¬ 
ventional image and video understanding methods. Indeed, 
as demonstrated by our experiments, our method can han¬ 
dle video collections featuring multiple object classes with 
minimal or zero supervision (it is, however, limited for the 
time being to one object instance per frame). 

We describe in the next two sections our foreground con¬ 
fidence and temporal consistency terms of Eq. (1), before 
describing in Section 4 our discovery and tracking algo¬ 
rithm, presenting experiments in Section 5, and concluding 
in Section 6 with brief remarks about future work. 

2. Foreground confidence 

Our foreground confidence term is defined as a weighted 
sum of appearance- and motion-based confidences: 

cp[rt,Vt,N{vt)] = ips.[rt,Vt,N{vt)]+acpm{rt)^ ( 2 ) 

For the appearance-based term denoted by we follow [4] 
and use a standout score based on region matching con¬ 
fidence. For the motion-based term denoted by (/^m, we 
build on long-term point track clusters [3] and propose a 
motion coherence score that measures how well the box re¬ 
gion aligns with motion clusters. 

2.1. Appearance-based confidence 

Foreground object regions are likely to match each other 
across videos with similar objects, and a region tightly 
bounding a foreground object stands out over the back¬ 
ground. Recent work on unsupervised object discovery in 
image collections [4] implements this concept through a 
standout score based on a region matching algorithm, called 
probabilistic Hough matching (PHM). Here we extend the 
idea to video frames. 

PHM is an efficient region matching algorithm which 
generates scores for region matches using appearance and 
geometric consistency. Assume two sets of region propos¬ 
als have been extracted from vt and Vu'^ Rt = R{vt) and 
Ru = R{vu). Let n = G be a region with 

its 8 X 8 HOG descriptor ft [6, 13] and its location It, i.e., 
position and scale. The score for match m = {rt^Vy) is 
decomposed into an appearance term = (/t,/u) and 
a geometry term Let x denote the location 

offset of a potential object common to Vt and Vy. Given 
Rt and Ry, PHM evaluates the match score c{m\Rt^Ry) 
by combining the Hough space vote h{x\Rt^Ry) and the 
appearance similarity in a pseudo-probabilistic way: 

c{m\Rt,Ru) = p{m^)^p{m^\x)h{x\Rt,Ru),{y) 

X 

h{x\Rt,Ru) = y^p(TOa)p(TOg|x), (4) 

m 

where p{my) is the appearance-based similarity between 
two descriptors ft and fy, and p{mg\x) is the likelihood 


of displacement k — lu^ which is defined as a Gaussian dis¬ 
tribution centered on x. As noted in [4], this can be seen as a 
combination of bottom-up Hough space voting (Eq. (4)) and 
top-down confidence evaluation (Eq. (3)). Given neighbor 
frames N{vt) where an object in vt may appear, the region 
saliency is defined as the sum of max-pooled match scores 
from R'^ tor: 

g{rt\Rt,Ru)= yZ max c({rt,ru)\Rt, Ru)■ (5) 

r^eRu 

VueN{vt) 

We omit the given terms Rt and Ry in function g for brief 
notation afterwards. The region saliency g{rt) is high when 
r matches the neighbor frames well in terms of both appear¬ 
ance and geometric consistency. While useful as an evi¬ 
dence for foreground regions, the region saliency of Eq. (5) 
may be higher on a part than a whole object because part 
regions often match more consistently than entire object re¬ 
gions. To counteract this effect, a standout score measures 
how much the region rt “stands out” from its potential back¬ 
grounds in terms of region saliency: 

sin) = gin) - max giy), 

r^eB(rt) 

s.t. B{rt) = {rbln c rB,rb e Rt}, (6) 

where rt G r\y indicates that region rt is contained in region 
Tb. As can be seen from Eq. (5), the standout score s{rt) 
evaluates a foreground likelihood of rt based on region 
matching between frame Vt and its neighbor frames N{vt). 
Now we denote it more explicitly using s[rt\vt, N{vt)) . 
The appearance-based foreground confidence for region rt 
is defined as the standout score of rt : 

^a.[rt,Vt,N{vt)] = s{rt\vt,N{vt)). (7) 

In practice, we rescale standout scores to cover [0,1] at each 
frame. 

2.2. Motion-based confidence 

Motion is an important cue for localizing moving ob¬ 
jects in videos and differentiating them from the back¬ 
ground [21]. To exploit this information, we propose the 
motion coherence score as another foreground confidence 
measure, which is built on clusters of long-term point 
tracks [3]. Since the motion clusters incorporate long-term 
spatio-temporal coherence, they are more “global” than 
conventional optical flows and long-term tracks. Using the 
motion clusters, we propose to compute the motion coher¬ 
ence score for a box region in three steps: (1) edge motion 
binning, (2) motion cluster weighting, (3) edge-wise max 
pooling. First, we divide a box region into 5x5 cells, and 
construct bins along its edges as illustrated in Fig. 2. Then, 
for each bin b, we assign its cluster label by majority vot- 
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(a) Video frame and its color-coded motion clusters. 
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(b) Measuring the motion coherence score for a box region. 



(c) Heat map of the scores and the top 5 boxes. 


Figure 2. Motion-based region confidence, (a) Given a video clip, 
its motion clusters are computed for each frame [3]. The exam¬ 
ple shows a frame (left) and its motion cluster with color coding 
(right), (b) Given a box region (yellow), the motion coherence 
score for the box is computed in three steps: box-boundary bin¬ 
ning (left), cluster weighting (middle), and edge-wise max pooling 
(right). For the details, see text, (c) Heat map of the motion co¬ 
herence scores (left) and the top 5 box regions with the best scores 
(right). (Best viewed in color.) 


ing using the tracks that fall into the bin. Second, we com¬ 
pute a weight for each motion cluster: 

# of tracks of cluster I within the box 

# of all tracks of cluster I in the frame ’ 


only touches the four edges (e.g., round objects). On this ac¬ 
count, edge-wise max pooling provides a more robust score 
than average pooling on entire cells. If the box does not 
touch any motion cluster boundary, the score becomes small 
since some tracks of pooled clusters lay outside of the box. 
This motion coherence score is useful to discover moving 
objects in video frames, and acts a complementary cue to 
the standout score in Section 2.1. 

3. Temporal consistency 

Regions with high foreground confidences may turn out 
to be temporally inconsistent. They can be misaligned due 
to imperfect confidence measures and ambiguous observa¬ 
tions. Also, given multiple object instances of the same cat¬ 
egory, foreground regions may correspond to different in¬ 
stances in a video. Our temporal consistency term is used 
to handle these issues so that selected spatio-temporal tubes 
are more stable and consistent temporally. We exploit both 
appearance- and motion-based evidences for this purpose. 
We denote by 'tps,{rt,rt+i) and n+i) appearance- 

and motion-based terms, respectively. The consistency term 
of Eq. (1) is obtained as 

V’(?’t,n+i) = tpa,{rt,rt+i) + ipni{rt,rt+i). (10) 

We describe these terms in the following subsections. 

3.1. Appearance-based consistency 

We use appearance similarity between two consecutive 
regions as a temporal consistency term. Region Vf is de¬ 
scribed by an 8 X 8 HOG descriptor ft, as in Section 2.1, 
and the appearance-based consistency is defined as the op¬ 
posite of the distance between descriptors: 

'ipairt,rt+i) =-\\ft - ft+i\\2, (11) 

which is rescaled in practice to cover [0,1] at each frame. 


evaluating how much of the motion cluster the box includes, 
compared to the entire frame. The weight is assigned to the 
corresponding bin, and suppresses the effect of background 
clusters in the bins. Third, we select the bin with the maxi¬ 
mum cluster weight along each edge, and define the sum of 
the weights as the motion coherence score for the box: 

¥’m(n) = V inaxM;((fe), (9) 

^' beEe 
ee{L,R,T,B} 

where e represents one of four edges of box region (left, 
right, top, bottom), a set of bins on the edge, and the 
cluster label of bin b. This score is designed to be high 
for a box region that contacts with motion cluster bound¬ 
aries (edge-wise max pooling) and contains the entire clus¬ 
ters (motion cluster weighting). Note that in most cases an 
object does not fill the entire area of its bounding box, but 


3.2. Motion-based consistency 

Two consecutive regions Vt and r^+i associated with the 
same object typically share the same point tracks, and con¬ 
figurations of the points in the two regions should be simi¬ 
lar. Long-term point tracks [3] provide correspondences for 
such points across frames, which we exploit to measure the 
motion-based consistency between a pair of regions. 

To compare the configurations of shared point tracks, we 
linearly transform each box region and internal point coor¬ 
dinates into a unit square with edge length 1, as illustrated in 
Fig. 3. Using the transformed coordinates, we can compare 
the point configurations up to affine variation between the 
regions. Let p be an individual point track and pt the coor¬ 
dinate of p at frame t. Then, the coordinate of p transformed 
by region Vt is denoted by r{pt\rt). If two consecutive re¬ 
gions Vt and rt+i cover the same object and share a point 
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Figure 3. Motion-based temporal consistency. We compare two 
sets of corresponding points in consecutive regions by transform¬ 
ing them into a unit square from the regions. The configuration 
of points does not align with each other unless two regions match 
well (e.g., black and green). The motion-based consistency uses 
the sum of distances between the corresponding points in the trans¬ 
formed domain. If two regions share no point track, we assign a 
constant 0 as the consistency term. (Best viewed in color.) 


track j9, T{pt\rt) and r(j9t+i|n+i) should be close to each 
other. The motion-based consistency reflects 

this observation. Let be the set of points occupied by 
region Vf. The motion-based consistency is defined as 


^mirt,rt+i) = - 


\riPt\rt) - T{pt+i\rt+i)\U 


pePr-^nPr,^^ 


P n P 

-Lrt'' 


( 12 ) 

If Vf and rt+i share no point track, we assign a constant 
value P+i) = 0, which is smaller than -1, the mini¬ 
mum value of to penalize transitions between 

regions having no point correspondence. This is a bit more 
inclusive than described in Section 1.2, for added robust¬ 
ness. Through this consistency term, we can measure vari¬ 
ations in spatial position, aspect ratio, and scales between 
regions at the same time. 


4. Discovery and tracking algorithm 

We initialize each tube r as an entire video (a sequence 
of entire frames), and alternate between (1) updating the 
neighborhood structure across videos and (2) optimizing 
Ptv{r) within each video. The intuition is that better ob¬ 
ject discovery may lead to more accurate object tracking, 
and vice versa. These two steps are repeated for a few iter¬ 
ations until (near-) convergence. In our experiments, using 
more than 5 iterations does not improve performance. The 
number of neighbors for each frame is fixed as /c = 10. The 
final result is obtained by selecting the best tube for each 
video at the end. As each video is independently processed 
at each iteration, the algorithm is easily parallelized. 
Network update. Given a localized tube r fixed for each 
video, we update the neighborhood structure N hy k near¬ 


est neighbor retrieval for each localized object region. At 
the first iteration, the nearest neighbor search is based on 
distances between GIST descriptors [30] of frames as the 
tube r is initialized as the entire video. From the second 
iteration, the metric is defined as the appearance similar¬ 
ity between potential object regions localized at the previ¬ 
ous iteration. Specifically, we select top 20 region propos¬ 
als inside the potential object regions according to region 
saliency (Eq. (5)), and perform PHM between those small 
sets of regions. The similarity is then computed as the sum 
of all region saliency scores given by the matching. This 
selective region matching procedure allows us to perform 
efficient and effective retrieval for video frames. 

Object relocalization. Given the neighborhood structure 
N, we optimize the objective of Eq. (1) for each video v. 
To exploit the tubes localized at the previous iteration, we 
confine region proposals in neighbor frames to those con¬ 
tained in the localized tube of the frames. This is done in 
Eq.(7) by substituting the neighbor frames of each frame Vt 
with the regions Vu localized in the frames: set Wu = Vu 
for all Wu in N{vt). Before the optimization, we compute 
foreground confidence scores of region proposals, and se¬ 
lect the top 100 among these according to their confidence 
scores. Only the selected regions are considered during op¬ 
timization for efficiency. The objective of Eq.(l) is then 
efficiently optimized by dynamic programming (DP) [22]. 
Note that using the p best tubes (p = 5 in all our experi¬ 
ments) for each video at each iteration except the last one, 
instead of retaining only one candidate at each iteration, in¬ 
creases the robustness of our approach. This agrees with the 
conclusions of [4] in the still image domain, and has also 
been confirmed empirically by our experiments. We obtain 
p best tubes by sequential DPs, which iteratively remove the 
best tube and re-run DP again. ^ 

5. Implementation and results 

Our method is evaluated on the YouTube-Object dataset 
[23], which consists of videos downloaded from YouTube 
by querying for 10 object classes from PASCAL VOC [9]. 
Each video of the dataset comes from a longer video and 
is segmented by automatic shot boundary detection. This 
dataset is challenging since the videos involve large camera 
motions, view-point changes, encoding artifacts, editing ef¬ 
fects, and incorrect shot boundaries. Ground-truth boxes are 
given for a subset of the videos, and one frame is annotated 
per video for evaluation. Eollowing [15], our experiments 
are conducted on all the annotated videos. 

We demonstrate the effectiveness of our method through 
various experiments. Eirst, we evaluate our method in the 
weakly-supervised colocalization setting, where all videos 

^It has been empirically shown in multi-target tracking that sequential 
DP performs close to the global optimum with greater efficiency than the 
optimal algorithm [22]. 
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contain at least one object of the same category. Our method 
is also tested in a fully unsupervised mode, where all videos 
from all classes of the dataset are mixed; we call this chal¬ 
lenging setting unsupervised object discovery. 

5.1. Implementation details 

Key frame selection. We sample key frames from each 
video uniformly with stride 20, and our method is used 
only on the key frames. This is because temporally adja¬ 
cent frames typically have redundant information, and it is 
time-consuming to process all the frames. Note that long¬ 
term point tracks enable us to utilize continuous motion in¬ 
formation although our method works on temporally sparse 
key frames. To obtain temporally dense localization results, 
object regions in non-key frames are estimated by interpo¬ 
lating localized regions in temporally adjacent key-frames. 

Parameter setting. The weight for the motion-based con¬ 
fidence a and that for the temporal consistency terms A are 
set to 0.5 and 2, respectively. To penalize transitions be¬ 
tween regions sharing no point track, 0 is set to -2, smaller 
than the minimum value of when two regions share 
points. The parameters are fixed for all experiments. 

5.2. Evaluation metrics 

Our method not only discovers and localizes objects, but 
also reveals the topology between different videos and the 
objects they contain. We evaluate our results on those two 
tasks with different measures. 

Localization accuracy is measured using CorLoc [15, 
21, 23], which is defined as the percentage of images 
correctly localized according to the PASCAL criterion: 
area(r^ur^t) ^ predicted region and 

Tgt is the ground-truth. 

In the unsupervised object discovery setting, we mea¬ 
sure the quality of the topology revealed by our method as 
well as localization performance. To this end, we first em¬ 
ploy the CorRet metric, originally introduced in [4], which 
is defined in our case as the mean percentage of retrieved 
nearest neighbor frames that belongs to the same class as 
the target video. We also measure the accuracy of nearest 
neighbor classification, where a query video is classified by 
the most frequent labels of its neighbor frames retrieved by 
our method. The classification accuracy is reported by the 
top-k error rate, which is the percentage of videos whose 
ground-truth labels do not belong to the k most frequent la¬ 
bels of their neighbor frames. All the evaluation metrics are 
given as percentages. 

5.3. Object colocalization per class 

We compare our method with two colocalization meth¬ 
ods for videos [15, 23]. We also compare our method with 
several of its variants to highlight benefits of each of its 



Figure 4. Average CorLoc scores (left) and average overlap ratios 
{right) versus iterations on the YouTube-Object dataset in the colo¬ 
calization setting. 


components. Specifically, the components of our method 
are denoted by combinations of four characters: ‘F’ for 
foreground confidence, ‘T’ for temporal consistency, ‘A’ for 
appearance, and ‘M’ for motion. For example, F(A) means 
foreground saliency based only on appearance {i.e., cpa), 
and T(A,M) indicates temporal smoothness based on both 
of appearance and motion (i.e., 'll;a + Our full 

model corresponds to F(A,M)-i-T(A,M). 

Quantitative results are summarized in Table 1. Our 
method outperforms the previous state of the art in [15] 
on the same dataset, with a substantial margin. Compar¬ 
ing our full method to its simpler versions, we observe that 
performance improves by adding each of the temporal con¬ 
sistency terms. The motion-based confidence can damage 
performance when motion clusters include only a part of 
object (e.g., “bird”, “dog”) and/or background has distinc¬ 
tive clusters due to complex 3D structures (e.g., car, mo¬ 
torbike). However, it enhances localization when the ob¬ 
ject is highly non-rigid (e.g., “cat”) and/or is clearly sep¬ 
arated from the background by motion (e.g., “aeroplane”, 
“boat”). In the “train” class case, where our method with¬ 
out motion-based confidence often localize only a part of 
long trains, the motion-based confidence significantly im¬ 
proves localization accuracy. Fig. 4 illustrates the perfor¬ 
mance of our method over iterations. Our full method per¬ 
forms better than its variants at every iteration, and most 
quickly improves both of CorLoc score and overlap ratio in 
early stages. 

Sample qualitative results are shown in Fig. 5 and 6, 
where the regions localized by our full model are compared 
with those of F(A), which relies only on image-based infor¬ 
mation. F(A) already outperforms the previous state of the 
art, but its results are often temporally inconsistent when the 
object undergoes severe pose variation or multiple target ob¬ 
jects exist in a video. We handle this problem by enforcing 
temporal consistency on the solution. 
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Table 1. CorLoc scores on the YouTube-Object dataset. 


Method 

aeroplane 

bird 

boat 

car 

cat 

cow 

dog 

horse 

motorbike 

train 

Avg. 

Prest et al. [23] 

51.7 

17.5 

34.4 

34.7 

22.3 

17.9 

13.5 

26.7 

41.2 

25.0 

28.5 

Joulin et al. [15] 

25.1 

31.2 

27.8 

38.5 

41.2 

28.4 

33.9 

35.6 

23.1 

25.0 

31.0 

f(aF 

38.2 

67.3 

30.4 

75.0 

28.6 

65.4 

38.3 

46.9 

52.0 

25.9 

46.8 

F(A)-fT(M) 

44.4 

68.3 

31.2 

76.8 

30.8 

70.9 

56.0 

55.5 

58.0 

27.6 

51.9 

F(A)-fT(A,M) 

52.9 

72.1 

55.8 

79.5 

30.1 

67.7 

56.0 

57.0 

57.0 

25.0 

55.3 

Ours, full^ 

56.5 

66.4 

58.0 

76.8 

39.9 

69.3 

50.4 

56.3 

53.0 

31.0 

55.7 

Brox and Malik [3] 

53.9 

19.6 

38.2 

37.8 

32.2 

21.8 

27.0 

34.7 

45.4 

37.5 

34.8 

Papazoglou and Ferrari [21] 

65.4 

67.3 

38.9 

65.2 

46.3 

40.2 

65.3 

48.4 

39.0 

25.0 

50.1 

Ours, full—unsupervised 

55.2 

58.7 

53.6 

72.3 

33.1 

58.3 

52.5 

50.8 

45.0 

19.8 

49.9 


^ Our re-implementation of PHM [4]. ^ Our full method corresponds to F(A,M)-i-T(A,M). 



Figure 5. Examples of object correctly localized by our full method: {red) our full method, {green) our method without motion information, 
{yellow) ground-truth localization. The sequences come from (a) “aeroplane”, (b) “car”, (c) “cat”, (d) “dog”, (e) “motorbike”, and (f) “train” 
classes. Frames are ordered by time from top to bottom. The localization results of our full method are spatio-temporally consistent. On 
the other hand, the simpler version often fails due to pose variations (a, c-e) or produces inconsistent tracks when multiple target objects 
exist (b). More results are included in the supplementary material. (Best viewed in color.) 


5.4. Unsupervised object discovery and tracking 

In the unsupervised setting, where videos with differ¬ 
ent object classes are all mixed together, our method still 
outperforms existing video colocalization techniques even 
though it does not use any supervisory information, as sum¬ 
marized in Table 1 . It performs slightly worse than the state 
of the art in video segmentation [21], which uses a fore¬ 
ground/background appearance model. Note however that 
(1) such a video-specific appearance model would proba¬ 
bly further improve our localization accuracy; and (2) our 
method attacks a more difficult problem, and, unlike [21], 
discovers the underlying topology of the video collection. 

The quality of nearest-neighbor retrieval is measured 
by CorRet and quantified in Table 2. Even in the case 
where some neighbors do not come from the same class 


as the query, object candidates in the neighbor frames usu¬ 
ally resemble to those in the query frame, as illustrated in 
Fig. 7. To illustrate the recovered topology between classes, 
we provide a confusion matrix of the retrieval results in 
Fig. 8, showing that most classes are most strongly con¬ 
nected to themselves, and some classes with similar appear¬ 
ances {e.g., “cat”, “dog”, “cow”, and “horse”) have to some 
extent connections between them. Finally, we measure the 
accuracy of nearest neighbor classification that is based on 
neighbor frames provided by our method and their ground- 
truth labels. The classification accuracy in top-1 and top-2 
error rates is summarized in Table 2. The error rates are 
low when the query class usually shows unique appearances 
{e.g., “aeroplane”, “boat”, “car”, and “train”), while high if 
there are other classes with similar appearances {e.g., “cat”, 
“dog”, “cow”, and “horse”). 
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Figure 6. Examples incorrectly localized by our full method: {red) our full method, {green) our method without motion information, {yel¬ 
low) ground-truth localization. The sequences come from (a) “aeroplane”, (b) “bird”, (c) “car”, (d) “cow”, (e) “horse”, and (f) “motorbike”. 
Frames are ordered by time from top to bottom. Our full method fails when background looks like an object and is spatio-temporally more 
consistent than the object (a, c), or the boundaries of motion clusters include the multiple objects or background together (b, d-e). The 
localization results in (b) and (f) are reasonable although they are incorrect according to the PASCAL criterion. (Best viewed in color.) 


Table 2. CorRet scores and top-k error rates of our method on the YouTube-Object dataset in the fully unsupervised setting. 


Metric 

aeroplane 

bird 

boat 

car 

cat 

cow 

dog 

horse 

motorbike 

train 

Avg. 

CorRet 

66.9 

36.1 

49.5 

51.8 

15.9 

30.6 

20.7 

22.6 

15.3 

45.5 

35.5 

Top-1 error rate 

12.1 

51.9 

34.1 

25.0 

84.2 

45.7 

70.2 

73.4 

83.0 

33.6 

51.3 

Top-2 error rate 

4.6 

46.2 

10.9 

18.8 

60.9 

24.4 

41.1 

49.2 

63.0 

20.7 

34.0 




Figure 7. A query frame (bold outer box) from the “horse” class 
and its nearest neighbor frames at the last iteration of the unsu¬ 
pervised object discovery and tracking. The appearances of top-5 
object candidates (inner boxes) of the nearest neighbors look sim¬ 
ilar with those of the query, although half of them come from the 
“cow” class (4th, 6th, 8th, and 9th) or the “car” class (5th). 


6. Discussion and Conclusion 

We have proposed a novel approach to localizing objects 
in an unlabeled video collection by a combination of ob¬ 
ject discovery and tracking. Not only does our method find 
objects in each video, it also reveals a network structure as¬ 
sociating frames and objects across videos. It alternatively 
optimizes the localization objective and the neighborhood 
structure, improving each. We have demonstrated the ef¬ 
fectiveness of the proposed method on the YouTube-Object 
dataset, where it significantly outperforms the state of the 
art in colocalization even though it uses much less supervi¬ 
sion. Some issues still remain for further exploration. As it 
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Figure 8. Confusion matrix of nearest neighbor retrieval. Rows 
correspond to query classes and columns indicate retrieved classes. 
Diagonal elements correspond to the CorRet values on Table 2. 


stands, our method is not appropriate for videos with a sin¬ 
gle dominant background and highly non-rigid object {e.g., 
the UCF-sports dataset). Next on our agenda is to address 
these issues, using for example video stabilization and fore¬ 
ground/background models [17, 21]. 
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